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Q^'. Abstract 
O . 

, We present a general framework of semi-supervised dimensionality reduction for 

manifold learning which naturally generalizes existing supervised and unsuper- 
vised learning frameworks which apply the spectral decomposition. Algorithms 
derived under our framework are able to employ both labeled and unlabeled 
, examples and are able to handle complex problems where data form separate 

' clusters of manifolds. Our framework offers simple views, explains relationships 

among existing frameworks and provides further extensions which can improve 
existing algorithms. Furthermore, a new semi-supervised kernelization frame- 
^ . work called "KPCA trick" is proposed to handle non-linear problems. 

c/3 [ Keywords: Semi-supervised Learning, Transductive Learning, Spectral Meth- 

, ods, Dimensionality Reduction, Manifold Learning, KPCA Trick. 
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^ ! 1 Introduction 

^ : 

^SJ , In many real-world applications, high-dimensional data indeed lie on (or near) 

0^ ' a low-dimensional subspace. The goal of dimensionality reduction is to reduce 

I complexity of input data while some desired intrinsic information of the data 

■rj" ■ is preserved. The desired information can be discriminative [H 12 13 HI El E] j 

O I geometrical [2l|8l|9l[l0] or both [TI]- Fisher discriminant analysis (FDA) [12] is 

the most popular method among all supervised dimensionality reduction algo- 
rithms. Denote c as the number of classes in a given training set. Provided that 
training examples of each class lie in a linear subspace and do not form several 
separate clusters, i.e. do not form multi-modality, FDA is able to discover a 
; ^ ' low-dimensional linear subspace (with at most c — 1 dimensionality) which is 

I efficient for classification. Recently, many works have improved the FDA algo- 

rithm in several aspects [TTl [1] [51 [31 [U [51 [S] . These extended FDA algorithms 
are able to discover a nice low-dimensional subspace even when training exam- 
ples of each class lie in separate clusters of complicated non-linear manifolds. 
Moreover, a subspace discovered by these algorithms has no limitation of c — 1 
dimensionality. 

Although the extended FDA algorithms work reasonably well, a considerable 
number of labeled examples is required to achieve satisfiable performance. In 
many real- world applications such as image classification, web page classification 
and protein function prediction, a labeling process is costly and time consuming; 
in contrast, unlabeled examples can be easily obtained. Therefore, in such 
situations, it can be beneficial to incorporate the information which is contained 
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in unlabeled examples into a learning problem, i.e., semi-supervised learning 
(SSL) should be applied instead of supervised learning TS]. 

In this paper, we present a general semi-supervised dimensionality reduction 
framework which is able to employ information from both labeled and unlabeled 
examples. Contributions of the paper can be summarized as follows. 

• As the extended FDA algorithms, algorithms developed in our framework 
are able to discover a nice low-dimensional subspace even when training exam- 
ples of each class form separate clusters of complicated non-linear manifolds. 
In fact, those previous supervised algorithms can be casted as instances in our 
framework. Moreover, our framework explains previously unclear relationships 
among existing algorithms in a simple viewpoint. 

• We present a novel technique called the Hadamard power operator which 
improves the use of unlabeled examples in previous algorithms. Experiments 
show that the Hadamard power operator improves the classification performance 
of a semi-supervised learner derived from our framework. 

• We show that recent existing semi-supervised frameworks applying spec- 
tral decompositions known to us [TH [TS] can be viewed as special cases of our 
framework. Moreover, empirical evidence shows that semi-supervised learners 
derived from our framework are superior to existing learners in many standard 
problems. 

• A new non- linearization framework, namely, a KPCA trick framework |16j 
is extended into a semi-supervised learning setting. In contrast to the standard 
kernel trick, the KPCA trick does not require users to derive new mathematical 
formulas and to rc-implcmcnt the kernel version of the original learner. 

2 The Framework 

Let {xi, yi}l=i denote a training set of (. labeled examples, with inputs e W^" 
generated from a fixed but unknown probability distribution T'x, and corre- 
sponding class labels j/i € {l,...,c} generated from Vy\y^. In addition to the 
labeled examples, let {'^iYit^+i denote a set of u unlabeled examples also gen- 
erated from Vyi- Denote X S R''o^(^+") as a matrix of the input examples 
(xi, Xf+„). We define n = i + u. The goal of semi-supervised learning (SSL) 
dimensionality reduction is 

GoaL Using the information of both labeled and unlabeled examples, we want 
to map (x S R'*") i-^- (z e M'') where d < do, such that in the embedded space 
Vyiz can be accurately estimated ( i.e., unknown labels are easy to predict) by 
a simple classifier. 

Here, following the previous works in the supervised setting |lll [Tl[2]. the near- 
est neighbor algorithm is used for representing a simple classifier mentioned in 
the goal. Note that important special cases of SSL problems arc transductive 
problems where we only want to predict the labels {j/ili^^Yi given un- 

labeled examples. In order to make use of unlabeled examples in the learning 
process, we make the following so-called manifold assumption fl3|: 

Semi-Supervised Manifold Assumption. The support of Vx is on a low- 
dimensional manifold. Furthermore, Vy\x is smooth, as a function of x, with 
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respect to the underlying structure of the manifold. 



At first, to fulfill our goal, we linearly parameterize = Axi where A € 
Idxdo^ Thus, AX = (Axi, ...,Ax„) e R''^" is a matrix of embedded points. An 



efficient non-linear extension is presented in Section 12.21 In our framework, we 
propose to cast the problem as a constrained optimization problem: 

A* = aTgmmf{AX)+jr{AX), (1) 
AeA 

where /^(•) and /"(•) are objective functions based on labeled and unlabeled 
examples, respectively, 7 is a parameter controlling the weights between the two 
objective functions and ^ is a constraint set in W^^"^". The two objective func- 
tions determine "how good the embedded points are"; thus, their arguments are 
AX , a matrix of embedded points. Up to orthogonal and translational transfor- 
mations, we can identify embedded points via their pairwise distances instead 
of their individual locations. Therefore, we can base the objective functions 
on pairwise distances of embedded examples. Here, we define the objective 
functions to be linear with respect to the pairwise distances: 

n n 

f{AX) = J2 4" dist(Ax,,v4xj) and r{AX) = ^ c^^- dist{Ax,, Axj), 

where dist(-, •) is an arbitrary distance function between two embedded points, 
cfj and cfj are costs which penalize an embedded distance between two points 
i and j. A specification of cfj and c^- are based on the label information and 
unlabel information, respectively, as described in Section [2Tl 

If we restrict ourselves to consider only the cases that (I) dist(-, •) is a squared 
Euclidean distance function, i.e. dist{Axi, Axj) = \\Axi — AxjW^, (II) cfj and 
cfj are symmetric, and (III) A E A is in the form of ABA^ — I where B is 
a positive semidefinite (PSD) matrix, Eq.([T]) will result in a general framework 
which indeed generalizes previous frameworks as shown in Section [3l Define 
Cij = + ic^j- We can rewrite the weighted combination of the objective 
funtions in Eq.|T]) as follows: 

f{AX)+^r{AX) 

71 n 

= ^ cljdist{Axi,AKj) + 7 ^ c^j dist(Ax^, Ax_^-) 

71 n 

= ^ i4j + ^d^j)dist{Axi, Axj) ^ ^ Cydist(Axi,ylxj) 

n n 

(n n 

= 2lTs.c%{AX{D -C)X^ X^), 
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where C is a symmetric cost matrix with elements and D is a diagonal matrix 
with Da — ^ ■ Cij0. Thus, the optimization problem ([T]) can be restated as 



A* = argmin tTace{AX{D - C)X^ ), (2) 

ABA'^=I 

Note that the constraint ABA^ = I prevents trivial solutions such as every 
Ayii is a zero vector. If i? is a positive definite (PD) matrix, a solution of the 
above problem is given by the bottom d eigenvectors of the following generalized 
eigenvalue problem [12l [17] 

X{D- C)X^a(^) = AjBa(j\ j = l,...,d. (3) 

Then the optimal linear map is 

A* = (a(i\...,a('^))^. (4) 

Note that, in terms of solutions of Eq.([3]), it is more convenient to represent A* 
by its rows a^'^ than its columns a^. Moreover, note that 

||z-z'|| = \\A*x - A*x'\\. (5) 

Therefore, kNN in the embedded space can be performed. Consequently, an 
algorithm implemented under our framework consists of three steps as shown 
in Figure [TJ 



Input: 1. training examples: {(xi, j/i), (x^, y^), x^+i, Xf+„} 

2. a new example: x' 

3. a positive- value parameter: 7 
Algorithm: 

(1) Construct cost matrices, and C = + 7C", 
and a constraint matrix B (see Section [TT]) . 

(2) Obtain an optimal matrix A* by solving Eq.Q. 

(3) Perform kNN classification in the obtained subspace by using Eq.([5|). 



Figure 1: Our semi-supervised learning framework. 



2.1 Specification of the Cost and Constraint Matrices 

In this section, we present various reasonable approaches of specifying the two 
cost matrices, and C", and the constraint matrix, B, by using the label 
and unlabel information. We use the two words "unlabel information" and 
"neighborhood information" interchangeably in this paper. 

2.1.1 The Cost Matrix and the Constraint Matrix B 

Normally, based on the label information, classical supervised algorithms usu- 
ally require an embedded space to have the following two desirable conditions: 

^To simplify our notations, in this paper whenever we define a cost matrix C' having 
elements c' , we always define its associated diagonal matrix D' with elements D' = c' . 
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Figure 2: An example when data form a multi-modal structure. An algorithm, 
e.g. FDA, which imposes the condition (1) will try to discover a new subspace 
(a dashed line) which merges two clusters A and B altogether. An obtained 
space is undesirable as data of the two classes are mixed together. In contrast, 
an algorithm which imposes the condition (1*) (instead of (1)) will discover 
a subspace (a thick line) which does not merge the two clusters A and B as 
there are no nearby examples (indicated by a link between a pair of examples) 
between the two clusters. 

(1) two examples of the same class stay close to one another, and 

(2) two examples of different classes stay far apart. 

The two conditions are imposed in classical works such as FDA. However, the 
first condition is too restrictive to capture manifold and multi-modal structures 
of data which naturally arise in some applications. Thus, the first condition 
should be relaxed as follows: 

(1*) two nearby examples of the same class stay close to one another 

where "nearby examples" , defined by using the neighborhood information, are 
examples which should stay close to each other in both original and embedded 
spaces. The specification of "nearby examples" has been proven to be successful 
in discovering manifold and multi-modal structure [TTl [U 1^1 [31 IH 13 El HSl UHl 
[inillllHl]. See Figure [H for explanations. In some cases, it is also appropriate 
to relax the second condition to 

(2*) two nearby examples of different classes stay far apart. 

In this section, we give three examples of cost matrices which satisfy the 
conditions (I*) and (2) (or (2*)). These three examples are recently introduced 
in previous works, namely. Discriminant Neighborhood Embedding (DNE) [2], 
Marginal Fisher Analysis (MFA) [I and Local Fisher Discriminant Analysis 
(LFDA) [llj , with different presentations and motivations but they can be uni- 
fied under our general framework. 

Firstly, to specify nearby examples, we construct two matrices and 
based on any sensible distance (Euclidean distance is the simplest choice). For 
each Xi, let Neig^ (i) be the set of k nearest neighbors having the same label j/i, 
and let Neig^{i) be the set of k nearest neighbors having different labels from 
Hi. Define and as follows: let cjj = = if points and/or are 
unlabeled, and 




f , if j e Neig\i) V i e Neig^{j), 
0, otherwise, and 




f , if j e Neig'^ii) V i G Neig'^{j), 
0, otherwise. 
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The specification cj^ = 1 and cfj — 1 represent nearby examples in the condi- 
tions (1*) and (2*). Then, and B of existing algorithms (see Eq. ^) are: 

Discriminant Neighborhood Embedding (DNE) 

fje ^Qi _(jE ^ (g^j^ identity matrix) 

Marginal Fisher Analysis (MFA) 

= _C^^ B = X{D^ - C^)X^ 

Local Fisher Discriminant Analysis (LFDA) 

Let ni, ric be the numbers of examples of classes 1, c, respectively. Define 
matrices C*"^* and C""** as: 

^bet ^ Uhi^ " if y^ = V, = k, ^ i^ci^, if y^ = % = fc, 

^'^ \~n^ otherwise, ^'^ lo, otherwise. 

Within our framework, relationships among the three previous works can be 
explained. The three methods exploit different ideas in specifying matrices 
and B to satisfy two desirable conditions in an embedded space. In DNE, 
is designed to penalize an embedded space which does not satisfy the condition 
(1*) and (2*). In MFA, the constraint matrix B is designed to satisfy the 
condition (1*) and is designed to penalize an embedded space which does 
not satisfy the condition (2*). 

Things are not quite obvious in the case of LFDA. In LFDA, the constraint 
matrix B is designed to satisfy the condition (1*) since elements C""** are pro- 
portional to C^; nevertheless, since weights are inversely proportional to n^, 
elements in a small class have larger weights than elements in a bigger class, i.e. 
a pair in a small class is more likely to satisfy the condition (1*) than a pair in 
a bigger class. To understand C^, we recall that 

trace(^A(7^^ - C^)X^A^) = ^ 4||Ax, - Ax,|| 



Uk n '■ — ' 71 

Vi=V3 Vi+Vi 



where at the third equality we use the constraint AXBX^A^ — I and hence 



trace(ylA(i:>"'** - C"'^*)A'^A'^) = ^ -^\\Ax, - Axj\\ 

— trace(/) — d. 

Hence, we observe that every pair of labeled examples coming from different 
classes has a corresponding cost of — i. Therefore, is designed to penalize 
an embedded space which does not satisfy the condition (2). Surprisingly, in 
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LFDA, nearby examples of the same class (having cjj — 1) also has a cost of 
— — . As a cost proportional to ~— is meant to preserve a pairwise distance 
between each pair of examples (see Section [XT]) . LFDA tries to preserve a local 
geometrical structure between each pair of nearby examples of the same class, 
in contrast to DNE and MFA which try to squeeze nearby examples of the same 
class into a single point 

We note that other recent supervised methods for manifold learning can also 
be presented and interpreted in our framework with different specifications of , 
for examples, Local Discriminant Embedding of Chen et al. [S] and Supervised 
Nonlinear Local Embedding of Cheng et al |6] . 

2.1.2 The Cost Matrix C" and the Hadamard Power Operator 

One important implication of the manifold assumption is that "nearby examples 
are likely to belong to a same class" . Hence, by the assumption, it makes sense 
to design C" such that it prevents any pairs of nearby examples to stay far apart 
in an embedded space. 

Among methods of extracting the neighborhood information to define C", 
methods based on the heat kernel (or the gaussian function) are most popular. 
Beside using the heat kernel, other methods of defining C" are invented, see 
[T51 Chap. 15] and ^17] for more details. The simplest specifications of nearby 
examples based on the heat kernel are: 

e« . exp(fci^). (6) 

Each pair of nearby examples will be penalized with different costs depended on 
their similarity, and a similarity between two points is based on the Euclidean 
distance between them in the input space. Incidentally, with this specification 
of C", the term f^{AX) in Eq. ([T]) can be interpreted as an approximation of 
the Laplace- Beltrami operator on a data manifold. A learner which employs 
C = C" (C^ = 0) is named Locally Preserving Projection (LPP) [S]. 

The parameter a is crucial as it controls the scale of a cost c^y Hence, the 
choice of a must be sensible. Moreover, an appropriate choice of a may vary 
across the support of T^x- Hence, the local scale ui for each point should be 
used. Let x'^ be the A;"* nearest neighbor of x^. A local scale is defined as 

a, ^ ||x', - x,||, 

and a weight of each edge is then defined as 

4 = exp( "^-~"^-"% (7) 

Using this local scaling method is proven to be efficient in previous experiments 
P5 j on clustering. A specification of k to define the local scale of each point 
is usually more convenient than a specification of a since a space of possible 
choices of k is considerably smaller than that of a. 

Instead of proposing yet another method to specify a cost matrix, here we 
present a novel method which can be used to modify any existing cost matrix. 
Let Q and R be two matrices of equal size and have qij and as their elements. 
Recall that the Hadamard product P [2J between Q and _R, P — Q Q R, has 
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elements pij = (lijfij- In words, the Hadamard product is a pointwise product 
between two matrices. Here, we define the Hadamard a*"^ power operator as 



QQ^QQQQ ...QQ. (8) 

Given a cost matrix C" and a positive integer a, we define a new cost matrix 
as 

where |i-||_F denotes the Frobenius norm of a matrix. The multiphcation of 
^\Qc^c^\\^ make |!C"°||f = ||C"||f- Note that if C" is symmetric and non- 
negative, C" still has these properties. 

The intuition of will be explained through experiments in Section [4] 
where we show that C" can further improve the quality of C" so that the 
classification performance of a semi-supervised learner is increased. 

Any combinations of a label cost matrix of those in previous works such 
as DNE, MFA and LFDA with an unlabel cost matrix C" result in new SSL 
algorithms, and we will call the new algorithms SS-DNE, SS-MFA and SS- 
LFDA. 



2.2 Non-Linear Parameterization Using the KPCA Trick 

By the linear parameterization, however, we can only obtain a linear subspace 
defined by A. Learning a non-linear subspace can be accomplished by the stan- 
dard kernel trick 25 . However, applying the kernel trick can be inconvenient 
since new mathematical formulas have to be derived and new implementation 
have to be done separately from the linear implementations. Recently, Chat- 
patanasiri et al. [16 have proposed an alternative kernelization framework called 
the KPCA trick, which does not require a user to derive a new mathematical 
formula or re-implement a kernelized algorithm. Moreover, the KPCA trick 
framework avoids troublesome problems such as singularity, etc. 

2.2.1 The KPCA- Trick Algorithm 

In this section, the KPCA trick framework is extended to cover learners im- 
plemented under our semi-supervised learning framework. Let •) be a PSD 
kernel function associated with a non-linear function 0(-) : W^'-' ^ H such that 
/c(x, x') = (0(x),0(x')) [26 where 7i is a Hilbert space. Denote (pi for (j>{xi) 
for i = 1, ...,£ + u and 0' for 0(x'). The central idea of the KPCA trick is to 
represent each <j)i and 0' in a new "finite" -dimensional space, with dimensional- 
ity bounded by ^ -f u, without any loss of information. Within the framework, 
a new coordinate of each example is computed "explicitly" , and each example 
in the new coordinate is then used as the input of any existing semi-supervised 
learner without any re-implementations. 

To simplify the discussion, we assume that {(pi} is linearly independent and 
has its center at the origin, i.e. (pi = 0. Since we have n = i ~\- u total 
examples, the span of {(pi} has dimensionality n by our assumption. According 
to [T^, each example (t>i can be represented as Lpi G M" with respect to a new 
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orthonormal basis {V'iliLi such that span({?Ai}f^i) is the same as span{{(j)i}'^^-^) 
without loss of any information. More precisely, we define 

= ((</)„ ^i),...,(./)„^„)) -^-^0,. (10) 

where ^ = {ipi, ...,tpn)- Note that although we may be unable to numerically 
represent each ipi, an inner-product of {(pi, ijjj) can be conveniently computed by 
KPCA where each ifii is a principal component in the feature space. Likewise, 
a new test point (f)' can be mapped to ip' = '^^(jJ . Consequently, the mapped 
data {^Pi] and ip' are finite-dimensional and can be explicitly computed. 

The KPCA-trick algorithm consisting of three simple steps is shown in Fig- 
ure [31 AH semi-supervised learners can be kernelized by this simple algorithm. 
In the algorithm, we denote a semi-supervised learner by ssl which outputs the 
best linear map A* . 



Input: 1. training examples: {(xi,j/i),. 




Vt)^ X£+i, x^+„} 


2. a new example: x' 






3. a kernel function: fc(-, •) 






4. a linear semi-supervised learning alg 


;orithm: ssl (see Figure [IJ 


Algorithm: 






(1) Apply kpca(fc, {xi}^+", x') such that 




1— > {(pi\ and x' I— > (/s'. 


(2) Apply ssl with new inputs {(931, yi), . 


•■, (<P£ 


,yi,),ipi+i,...,ipi,+u} 


to achieve A* . 






(3) Perform kNN based on the distance 







Figure 3: The KPCA-trick algorithm for semi-supervised learning. 



2.3 Remarks 

1. The main optimization problem shown in Eq.Q can be restated as follows: 

m 

SiT:g-amilTa.ce({ABA^)-^AX{D - C)X'^ A^\. 

Within this formulation, the corresponding optimal solution is invariant under 
a non-singular linear transformation; i.e., if A* is an optimal solution, then 
TA* is also an optimal solution for any non-singular T e M'*^'^ pp.447]. 
We note that four choices of T which assign a weight to each new axis are 
natural: {\) T — I, (2)Tisa diagonal matrix with Tn — "^"^r^i i-e. T nor- 
malizes each axis to be equally important, (3) T is a diagonal matrix with 
Tii = \fXi as determines how well each axis a*^*-' fits the objective function 
aii)Tx{D - C)X^aS^\ and (4) T is a diagonal matrix with Ti, = i.e. a 

combination of (2) and (3). 

2. The matrices B defined in Subsection 12.11 of the two algorithms, SS-MFA 
and SS-LFDA, are guarantee to be positive scmidcfinitc (PSD) but may not be 
positive definite (PD), i.e., B may not be full-rank. In this case, B is singular 
and we cannot immediately apply Eq.Q to solve the optimization problems. 
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One common way to solve this difficulty is to use {B + el), for some value 
of e, which is now guaranteed to be full-rank instead of B in Eq.®. Since 
e acts in a role of regularizer, it makes sense to set e = 7, the regularization 
parameter specified in Section [TTJ Similar settings of e have also been used by 
some existing algorithms, e.g. [27l IT4]. 

Also, in a small sample size problem where X{D — C)X'^ is not full-rank, the 
obtained matrix A* (or some columns of ^*) lie in the null space oi X{D—C)X^ . 
Although this matrix does optimize our optimization problem, it usually overfits 
the given data. One possible solution to this problem is to apply PCA to the 
given data in the first place [28] so that the resulted data have dimensionality 
less than or equal to the rank of Ar(_D — C)X'^ . Note that in our KPCA trick 
framework this pre-process is automatically accomplished as KPCA has to be 
applied to a learner as shown in Figure [31 



3 Related Work: Connection and Improvement 

As we already described in Section f2. 11 our framework generalizes various exist- 
ing supervised and unsupervised manifold learners [HI [H [2 O HI [13 [H [23] ■ 
The KPCA trick is new in the field of semi-supervised learning. 

There are some supervised manifold learners which cannot be represented 
in our framework [TBI [HI [IHl [HI [12] because their cost functions are not linear 
with respect to distances among examples. Extension of these algorithms to 
handle semi-supervised learning problems is an interesting future work. 

Yang et al. [29j present another semi-supervised learning framework which 
solves entirely different problems to problems considered in this paper. They 
propose to extend unsupervised algorithms such as ISOMAP [7] and Laplacian 
Eigenmap [T31 Chapter 16] to cases to which information about exact locations 
of some points is available. 

To the best of our knowledge, there are currently two existing semi-supervised 
dimensionality reduction frameworks in literatures which have similar goal to 
ours; both are very recently proposed. Here, we subsequently show that these 
frameworks can be restated as special cases of our framework. 

3.1 Sugiyama et al. [14j 

Sugiyama et al. [T3] extends the LFDA algorithm to handle a semi-supervised 
learning problem by adding the PCA objective function f^'~^^{A) into the ob- 
jective function f^{A) of LFDA described in Section [2TT1 To describe Sugiyama 
et al.'s algorithm, namely 'SELF', without loss of generality, we assume that 
training data are centered at the origin, i.e. X^ILi ~ ^'^'^ then we can 
write f^'~^^{A) = — X^iLill^-"-*!!^- Sugiyama et al. propose to solve the follow- 
ing problem: 



Interestingly, it can be shown that this formulation can be formulated in our 
framework with unlabel cost c" being negative, and hence our framework sub- 




(11) 
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sumes SELF. To see this, let = — l/2n, for all i,j — l,...,n. Then, the 
objective f^{A) is equivalent to f^'-''^{A): 



1 1 

r{A) = J2 -^11^^* - ^^j-ii' = E - ^'^^ - ^^^■) 

jj=i ■i,i=i 

1 / " " \ 

= h E (^X'' ^''^) - 2 ^ (Ax„ Ax,) 

/ n n n \ 

= 2n^||Ax,|p - 2(A^x„ Aj^x,) 

where we use the fact that X]r=i ^ This proves that SELF is a special 
case of our framework. 

Note that the use of negative unlabel costs cfj = —l/2n results in an al- 
gorithm which tries to preserve a global structure of the input data and does 
not convey the manifold assumption where only a local structure should be pre- 
served. Therefore, when the input unlabeled data lie in a complicated manifold, 
it is not appropriate to apply /"(A) = f^'~^^{A). 



3.2 Song et al. [H] 

Song et al. propose to extend FDA and another algorithm named maximum 
margin criterion (MMC) pOj to handle a semi-supervised learning problem. 
Their idea of semi-supervised learning extension is similar to ours as they add 
the term /"(•) into the objective of FDA and MMC (hence, we call them, 
SS-FDA and SS-MMC, respectively). However, SS-FDA and SS-MMC cannot 
handle problems where data of each class form a manifold or several clusters as 
shown in Figure [H because SS-FDA and SS-MMC satisfy the condition (1) but 
not (1*). In fact, SS-FDA and SS-MMC can both be restated as instances of 
our framework. To see this, we note that the optimization problem of SS-MMC 
can be stated as 

A* = argmin 7'trace (A5„,A^) - trace(AS'f,A'^) + 7/"(A), (12) 

AAT=I 

where Sb and 8^, are standard between-class and within-class scatter matrices, 
respectively [12]: 

c c 

^'^=Y1 E (^J - - and S-fc = ^(/x-/x,;)(/i-/x,)'^, 

where fi — ^7=i Mi ~ ^"=1 ^^'^ number of examples in the 

i*'' class. It can be checked that trace(AS',i,A^) = X)i j=i CiJ^ll^^i ~ ^^jll^ 
trace(A56A^) = J^lj^i - A^,f where 

c^ = |(4~"^' ^'^'^^^ ='' and c» = |-' if^^ = ^^=^' 
I—-, otherwise, 10, otherwise. 
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Accuracy — LFDA:0.935 — FDA:0.820 ~ LPP:0.620 
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Figure 4: The first toy example. The projection axes of three algorithms, namely 
FDA, LFDA, and LPP, are presented. Big circles and big crosses denote labeled 
examples while small circles and small crosses denote unlabeled examples. Their 
percentage accuracy over the unlabeled examples are shown on the top. 

Hence, by setting cjj — I'cfj — c\j we finish our proof that SS-MMC is a special 
case of our framework. The proof that SS-FDA is in our framework is similar 
to that of SS-MMC. 

3.3 Improvement over Previous Frameworks 

In this section, we explain why SELF and SS-FDA proposed by Sugiyama et al. 
[2] and Song et al. [TS] described above are not enough to solve some semi- 
supervised learning problems, even simple ones shown in Figure[3]and Figure[51 

In Figure HI three dimensionality reduction algorithms, FDA, LFDA and 
LPP are performed on this dataset. Because of multi-modality, FDA cannot 
find an appropriate projection. Since the two clusters do not contain data of 
the same class, LPP which tries to preserve the structure of the two clusters 
also fails. In this case, only LFDA can find a proper projection since it can cope 
with multi-modality and can take into account the labeled examples. Note that 
since SS-FDA is a linearly combined algorithm of FDA and LPP, it can only 
find a projection lying in between the projections discovered by FDA and LPP, 
and in this case SS-FDA cannot find an efficient projection, unlike LFDA and, 
of course, SS-LFDA derived from our framework. 

A similar argument can be given to warn an uncareful use of SELF in some 
situations. In Figured four dimensionality reduction algorithms, FDA, PCA, 
LFDA and LPP are performed on this dataset. Because of multi- modality, FDA 
and PCA cannot find an appropriate projection. Also, since there are only a 
few labeled examples, LFDA fails to find a good projection as well. In this case, 
only LPP can find a proper projection since it can cope with multi-modality 
and can take the unlabeled examples into account. Note that since SELF is a 
linearly combined algorithm of LFDA and PCA, it can only find a projection 
lying in between the projections discovered by LFDA and PCA, and in this case 
SELF cannot find a correct projection, unlike a semi-supervised learner like SS- 
LFDA derived from our framework which, as explained in Section [2. 1[ employs 
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Figure 5: The second toy example consisting of three clusters of two classes. 

the LPP cost function as its C". 

Since a semi-supervised manifold learner derived from our framework can be 
intuitively thought of as a combination of a supervised learner and an unsuper- 
vised learner. One may misunderstand that a semi-supervised learner cannot 
discover a good subspace if neither is a supervised learner nor an unsupervised 
learner able to discover a good subspace. The above two toy examples may 
also mislead the readers in that way. In fact, that intuition is incorrect. Here, 
we give another toy example shown in Figure [6] where only a semi-supervised 
learner is able to discover a good subspace but neither is its supervised and un- 
supervised counterparts. Intuitively, a semi-supervised learner is able to exploit 
useful information from both labeled and unlabeled examples. 

4 Experiments 

In this section, classification performances of algorithms derived from our frame- 
work are demonstrated. We try to use a similar experimental setting as those 
in previous works [HI [13 Chapter 21] so that our results can be compared to 
them. 

4.1 Experimental Setting 

In all experiments, two semi-supervised learners, SS-LFDA and SS-DNE, de- 
rived from our framework are compared to relevant existing algorithms, PCA, 
LPP*, LFDA, DNE and SELF 14J. In contrast to the standard LPP which 
does not apply the Hadamard power operator explained in Section l2.ll we de- 
note LPP* as a variant of LPP applying the Hadamard power operator. 

Non-linear semi-supervised manifold learning is also experimented by apply- 
ing the KPCA trick algorithm illustrated in Figure[3| Since it is not our intention 
to apply the "best" kernel but to compare efficiency between a "semi-supervised" 
kernel learner and its base "supervised" (and "unsupervised") kernel learners, 
we simply apply the 2"'^-degree polynomial kernel fc(x, x') = (x,x')^ to the 
kernel algorithms in all experiments. 
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Figure 6: (Left) The third toy example where only a semi-supervised learner is 
able to find a good projection. (Right) An undirected graph corresponding to 
the values of C" used by LPP and SS-LFDA. In this figure, a pair of examples i 
and J has a link if and only if > 0.1. This graph explains why LPP projects 
the data in the axis shown in the left figure; LPP, which does not apply the label 
information, tries to choose a projection axis which squeezes the two clusters as 
much as possible. Note that we apply a local-scaling method, Eq.©, to specify 



By using the nearest neighbor algorithm on their discovered subspaces, 
classification performances of the experimented learners are measured on five 
standard datasets shown on Table [l] the first two datasets are obtained from 
the UCI repository 01], the next two datasets mainly designed for testing a 
semi-supervised learner are obtained from http: / /www. kyb.tuebingen. mpg.de/ssl- 
book/benchmarks.html [T31 Chapter 21]. The final datasct, extended Yale B .32] , 
is a standard dataset of a face recognition task. The classification performance 
of each algorithm is measured by the average test accuracy over 25 realizations 
of randomly splitting each dataset into training and testing subsets. 

Three parameters are needed to be tuned in order to apply a semi-supervised 
learner derived from our framework (see Section [2.ip : 7, the regularizer, a, the 
degree of the Hadamard power operator and k, the fc*''-nearest neighbor param- 
eter needed to construct the cost matrices. To make our learners satisfy the 
condition (1*) described in Section [TH it is clear that k should be small com- 
pared to He, the number of training examples of class c. From our experience, 
we found that semi-supervised learners are quite insensitive to various small 
values of k. Therefore, in all our experiments, we simply set k = min(3, ric) 
so that only two parameters, 7 and a, are needed to be tuned. We tune these 
two parameters via cross validation. Note that only a is needed to be tuned for 
LPP* and only 7 is needed to be tuned for SELF. 

The 'Good Neighbors' score shown in Table [T] is due to Sugiyama et al. 
|14j . The score is simply defined as a training accuracy of the nearest neighbor 
algorithm when all available data are labeled and are given to the algorithm. 
Note that this score is not used by a dimensionality reduction algorithm. It 
just clarifies a usefulness of unlabeled examples of each dataset to the readers. 
Intuitively, if a dataset gets a high score, unlabeled examples should be useful 
since it indicates that each pair of examples having a high penalty cost c"- should 
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Table 1: Details of each dataset: do,c,£,u and t denote the numbers of input 
features, classes, labeled examples, unlabeled examples and testing examples, 
respectively. '*' denotes the transductive setting used in small datasets, where 
all examples which are not labeled are given as unlabeled examples and used as 
testing examples as well, d, determined by using prior knowledge, denotes the 
target dimensionality for each dataset. "GoOD Neighbors" denotes a quantity 
which measures a goodness of unlabeled data for each dataset. 



Name 


do 


c 


£ + u + t 


£ 


It 


d 


Good Neighbors 
linear kernel 


Ionosphere 


34 


2 


351 


10/100 


+ 


2 


0.866 


0.843 


Balance 


4 


3 


625 


10/100 


300 


1 


0.780 


0.760 


BCI 


117 


2 


400 


10/100 




2 


0.575 


0.593 


Usps 


241 


2 


1500 


10/100 


300 


10 


0.969 


0.971 


M-Eyale 


504 


5 


320 


20/100 




10 


0.878 


0.850 



belong to the same class. Note that on Table [T] there are two scores for each 
dataset: linear is a score on a given input space while kernel measures a 
score on a feature space corresponding to the 2"''-degree polynomial kernel. 

4.2 Numerical Results 

Numerical results are shown in Table [2] for the case of ^ = 10 (except M- 
Eyale where £ = 20) and Table [3] for the case of ^ = 100. In experiments, 
SS-DNE and SS-LFDA are compared their classification performances to their 
unsupervised and supervised counterparts: LPP* and DNE for SS-DNE, and 
LPP* and LFDA for SS-LFDA. SELF is also compared to SS-LFDA as they 
are related semi-supervised learners originated from LFDA. Our two algorithms 
will be highlighted if they are superior to their counterpart opponents. 

From the results, our two algorithms, SS-LFDA and SS-DNE, outperform 
all their opponents in 32 out of 40 comparisons: in the first setting of small £ 
(Tabled]), our algorithms outperform the opponents in 18 out of 20 comparisons 
while in the second setting of large £ (Table [3]), our algorithms outperform the 
opponents in 14 out of 20 comparisons. Consequently, our framework offers 
a semi-supervised learner which consistently improves its base supervised and 
unsupervised learners. 

Note that as the number of labeled examples increases, usefulness of unla- 
beled examples decreases. We will subsequently discuss and analyze the results 
of each dataset in details in the next subsections. 

4.2.1 Ionosphere 

Ionosphere is a real-world dataset of radar pulses passing through the iono- 
sphere which were collected by a system in Goose Bay, Labrador. The targets 
were free electrons in the ionosphere. "Good" radar returns are those showing 
evidence of some type of structure in the ionosphere. "Bad" returns are those 
that do not. Since we do not know the true decision boundary of Ionosphere, 
we simply set the target dimensionality d = c = 2. It can be observed that 
non-linearization does improve the classification performance of all algorithms. 
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Table 2: Percentage accuracies of SS-DNE and SS-LFDA derived from our 
framework compared to existing algorithms {£ — 10, except M-Eyale where 
£ = 20). SS-LFDA and SS-DNE are highlighted when they outperform their 
opponents (LPP* and DNE for SS-DNE, and LPP*, LFDA and SELF for SS- 
LFDA). Superscripts indicate %-confidence levels of the one-tailed paired t-test 
for differences in accuracies between our algorithms and their best opponents. 
No superscripts denote confidence levels which below 80%. 



Linear 


PCA 


LPP* 


DNE 


LFDA 


SELF 


SS-DNE 


SS-LFDA 


Ionosphere 


71±1.2 


82±1.3 


70±1.2 


71±1.1 


70±1.5 


75±1.0 


78.1±.9 


Balance 


49±1.9 


61±1.9 


63±2.2 


70±2.2 


69±2.3 


71±1.8^^ 


73±2.3**'' 


BCI 


49.8±.6 


53.4±.3 


51.3±.6 


52.6±.5 


52.1±.5 


57.1±.6^^ 


55.2±.3^^ 


USPS 


79±1.2 


74±1.0 


79.6±.6 


80.6±.9 


81.7±.8 


81.8±.5* 


83.0±.5* 


M-Eyale 


44.6±.7 


67±1.1 


66±1.2 


71.6±1.0 


67.2±.8 


76.9±.8*"' 


75.7±.9''^ 


Kernel 


PCA 


LPP* 


DNE 


LFDA 


SELF 


SS-DNE 


SS-LFDA 


Ionosphere 


70±1.8 


83.2±.9 


70±1.6 


71±1.3 


74±1.5 


87.2±.9™ 


88±1.0'''-' 


Balance 


41.7±.8 


47.9±.9 


62±2.5 


66±2.0 


60±2.8 


66±1.8**" 


69±1.9**'' 


BCI 


49.7±.3 


53.7±.3 


50.1 ±.4 


50.3±.6 


50.5±.4 


53.8±.3 


54.1±.3**° 


USPS 


77±1.1 


76±1.1 


79.9±.5 


80.3±.8 


80.9±.8 


82.0±.4^^ 


83.7±.6^'' 


M-Eyale 


42.1±.9 


63.2±.7 


58.0±.9 


60.3±.8 


58.8±.7 


69.9±.7^^ 


73.2±.8''^ 



Table 3: Percentage accuracies of SS-DNE and SS-LFDA compared to existing 
algorithms {i = 100). 



Linear 


PCA 


LPP* 


DNE 


LFDA 


SELF 


SS-DNE 


SS-LFDA 


Ionosphere 


72.8±.6 


83.7±.6 


77.9±.7 


74±1.0 


77.8±.5 


84.5±.6*'" 


84.9±.4^'' 


Balance 


57±2.2 


80±1.3 


86.4±.5 


87.9±.3 


87.2±.4 


88.2±.5^^ 


86.3±.6 


BCI 


49.5±.5 


54.9±.5 


53.1±.7 


67.9±.5 


67.6±.6 


63.1±.5^*' 


67.5±.6 


Usps 


91.4±.3 


75.7±.3 


91.1±.3 


89.3±.4 


92.2±.3 


92.2±.4^^ 


91.6±.3 


M-Eyale 


69.4±.4 


84.1±.4 


92.3±.4 


95.4±.3 


94.3±.2 


93.5±.4^^ 


95.7±.2 


Kernel 


PCA 


LPP* 


DNE 


LFDA 


SELF 


SS-DNE 


SS-LFDA 


Ionosphere 


79.8±.4 


89.7±.5 


78.7±.9 


81.3±.7 


81.1±.5 


93.6±.2'"^ 


93.7±.3^'^ 


Balance 


42.5±.3 


46.9±.5 


84.0±.7 


87.8±.7 


79±1.6 


86.5±.7^'^ 


87.7±.9 


BCI 


49.7±.5 


54.5±.4 


51.6±.6 


51.0±.8 


52.4±.6 


57.6±.2^^ 


57.0±.4^^ 


Usps 


91.1±.3 


81.5±.6 


91.4±.4 


91.2±.4 


92.7±.3 


92.3±.3^^ 


91.9±.3 


M-Eyale 


66.3±.3 


81.9±.5 


91.2±.3 


89.1±.5 


85.8±.6 


91.2±.3 


94.3±.3''^ 



It can be observed that LPP* is much better than PCA on this dataset, 
and therefore, unlike SELF, SS-LFDA much improves LFDA. In fact, the main 
reason that SS-LFDA, SS-DNE and LPP* have good classification performances 
are because of the Hadamard power operator. This is explained in Figures [3IH] 
and O From Figures [7] and [HI defining "nearby examples" be a pair of examples 
with a link (having c^- > 0.36), we see that almost every link connects nearby 
examples of the same class (i.e. connects good nearby examples). This indicates 
that our unlabel cost matrix C" is quite accurate as bad nearby examples rarely 
have links. In fact, the ratio of good nearby examples per total nearby examples 
(shortly, the good-nearby-examples ratio) is 394/408 « 0.966. Nevertheless, if 
we re-define "nearby examples" be a pairs of examples having, e.g., cfj > 0.01, 
the same ratio then reduces to 0.75 as shown in Figure [9] (Left). This indicates 
that many pairs of examples having small values of cfj are of different classes 
(i.e. bad nearby examples). 
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Figure 7: The undirected graph corresponding to C" constructed on Iono- 
sphere. Each link corresponds to a pair of nearby examples having c"^ > 0.36. 
The number '0.36' is just chosen for visualizability. 

Since an algorithm derived from our framework minimizes the cost-weighted 
average distances of every pair of examples (see Eq. ([2]) and its derivation), 
it is beneficial to further increases the cost of a pair having large c^- (since it 
usually corresponds to a pair of the same class) and decreases the cost of of a 
pair having small c"^ . From Eq. ^ , it can be easily seen that the effect of the 
Hadamard power operator is exactly what we need. The good-nearby-examples 
ratios after applying the Hadamard power operator with a = 8 are illustrated 
in Figure [5] (Right). Notice that, after applying the operator, even pairs with 
small values of cfj are usually of the same class. 

4.2.2 Balance 

Balance is an artificial dataset which was generated to model psychological 
experimental results. Each example is classified as having the balance scale tip 
to the right, tip to the left, or be balanced. The 4 attributes containing in- 
teger values from 1 to 5 are left_weight, left_distance, right_weight, 
and right_distance. The correct way to find the class is the greater of 
(left_distance X left_weight) and (right_distance X right_weight). 
If they are equal, it is balanced. Therefore, there are 5^ — 625 total exam- 
ples and 3 classes in this dataset. Moreover, the correct decision surface is 
1-dimensional manifold lying in the feature space corresponding to the (■,-)'^ 
kernel so that we set the target dimensionality d — 1. 

This dataset illustrates another flaw of using PCA in a classification task. 
After centering, the covariance matrix of the 625 examples is just a multiple 
of /, the identity matrix. Therefore, any direction is a principal component 
with largest variance, and PCA is just return a random direction! Hence, we 
cannot expect much about the classification performance of PCA in this dataset. 
Thus, PCA cannot help SELF improves much the performance on LFDA, and 
sometimes SELF degrades the performance of LFDA due to overfitting. In 
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Figure 8: Zoom- in on the square area of Figure [H 



contrast, SS-LFDA often improves the performance of LFDA. Also, SS-DNE is 
able to improve the classification performance of DNE and LPP* in all settings. 

4.2.3 BCI 

This dataset originates from the development of a Brain-Computer Interface 
where a single person performed 400 trials in each of which he imagined move- 
ments with either the left hand (the 1** class) or the right hand (the 2"'' class). 
In each trial, electroencephalography (EEG) was recorded from 39 electrodes. 
An autoregressive model of order 3 was fitted to each of the resulting 39 time 
series. The trial was represented by the total of 117 ~ 39*3 fitted parameters. 
The target dimensionality is set to the number of classes, d — c = 1. Similar 
to the previous datasets, SS-LFDA and SS-DNE are usually able to outperform 
their opponents. Again, PCA is not appropriate for this real- world dataset, and 
hence SELF is inferior to SS-LFDA. 

4.2.4 USPS 

This benchmark is derived from the famous USPS dataset of handwritten digit 
recognition. For each digit, 150 images are randomly drawn. The digits '2' and 
'5' are assigned to the first class, and all others form the second class. To prevent 
a user to employ a domain knowledge of the data, each example is rescaled, noise 
added, dimension masked and pixel shuffled [131 Chapter 21]. Although there 
are only 2 classes in this dataset, the original data presumably form 10 clusters, 
one for each digit. Therefore, the target dimension d is set to 10. 

Often, SS-LFDA and SS-DNE outperform their opponents. Nevertheless, 
note that SS-LFDA and SS-DNE do not improve much on LFDA and DNE 
when I = 100 because 100 labeled examples are quite enough to discriminating 
the data and therefore unlabeled examples offer relatively small information to 
semi-supervised learners. 
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Figure 9: For each number x in the the x-axis, its corresponding value on the 
y-axis is the ratio between the number of good nearby examples (having c"^ > x 
and belonging to the same class) and the number of nearby examples (having 
cfj > x). The ratios with respect to C"° are demonstrated where (Left) a — 1 
(the standard LPP), and where (Right) a = 8 (LP?*). 



4.2.5 M-Eyale 

This face recognition dataset is derived from extended Yale B |32j . There are 28 
human subjects under 9 poses and 64 illumination conditions. In our M-Eyale 
(Modified Extended Yale B), we randomly chose ten subjects, 32 images per 
each subject, from the original dataset and down-sampling each example to be 
of size 21x24 pixels. 

M-Eyale consists of 5 classes where each class consists of images of two 
randomly-chosen subjects. Hence, there should be two separated clusters for 
each class, and we should be able to see the advantage of algorithms employing 
the conditions (1*) and (2*) explained in Section 12.11 In this dataset, the 
number of labeled examples of each class is fixed to - so that examples of all 
classes are observed. Since this dataset should consist of ten clusters, the target 
dimensionality is set to d = 10. 

It is clear that LPP* performs much better than PCA in this dataset. Re- 
call that PCA captures maximum-variance directions; nevertheless, in this face 
recognition task, maximum-variance directions are not discriminant directions 
but directions of lighting and posing [5S]. Therefore, PCA captures totally 
wrong directions, and hence PCA degrades the performance of SELF from 
LFDA. In contrast, LPP* much better captures local structures in the dataset 
and discover much better subspaces. Thus, by cooperating LPP* with LFDA 
and DNE, SS-LFDA and SS-DNE are able to obtain very good performances. 



5 Conclusion 

We have presented a unified semi-supervised learning framework for linear and 
non-linear dimensionality reduction algorithms. Advantages of our framework 
are that it generalizes existing various supervised, unsupervised and semi-supervised 
learning frameworks employing spectral methods. Empirical evidences showing 
satisfiable performance of algorithms derived from our framework have been re- 
ported on standard datasets. 
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